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Abstract 

In this paper, we establish some theoretical con¬ 
nections between Sum-Product Networks (SPNs) 
and Bayesian Networks (BNs). We prove that ev¬ 
ery SPN can be converted into a BN in linear time 
and space in terms of the network size. The key 
insight is to use Algebraic Decision Diagrams 
(ADDs) to compactly represent the local condi¬ 
tional probability distributions at each node in the 
resulting BN by exploiting context-specific inde¬ 
pendence (CSI). The generated BN has a simple 
directed bipartite graphical structure. We show 
that by applying the Variable Elimination algo¬ 
rithm (VE) to the generated BN with ADD rep¬ 
resentations, we can recover the original SPN 
where the SPN can be viewed as a history record 
or caching of the VE inference process. To help 
state the proof clearly, we introduce the notion of 
normal SPN and present a theoretical analysis of 
the consistency and decomposability properties. 

We conclude the paper with some discussion of 
the implications of the proof and establish a con¬ 
nection between the depth of an SPN and a lower 
bound of the tree-width of its corresponding BN. 

1. Introduction 

Sum-Product Networks (SPNs) have recently been pro¬ 
posed as tractable deep models (Poon & Domingos, 2011) 
for probabilistic inference. They distinguish themselves 
from other types of probabilistic graphical models (PGMs), 
including Bayesian Networks (BNs) and Markov Networks 
(MNs), by the fact that inference can be done exactly in lin¬ 
ear time with respect to the size of the network. This has 
generated a lot of interest since inference is often a core 
task for parameter estimation and structure learning, and 
it typically needs to be approximated to ensure tractabil- 
ity since probabilistic inference in BNs and MNs is #P- 
complete (Roth, 1996). 


The relationship between SPNs and BNs, and more broadly 
with PGMs, is not clear. Since the introduction of SPNs in 
the seminal paper of Poon & Domingos (2011), it is well 
understood that SPNs and BNs are equally expressive in 
the sense that they can represent any joint distribution over 
discrete variables', but it is not clear how to convert SPNs 
into BNs, nor whether a blow up may occur in the con¬ 
version process. The common belief is that there exists 
a distribution such that the smallest BN that encodes this 
distribution is exponentially larger than the smallest SPN 
that encodes this same distribution. The key behind this 
belief lies in SPNs’ ability to exploit context-specific inde¬ 
pendence (CSI) (Boutilier et al., 1996). 

While the above belief is correct for classic BNs with tabu¬ 
lar conditional probability distributions (CPDs) that ignore 
CSI, and for BNs with tree-based CPDs due to the repli¬ 
cation problem (Pagallo, 1989), it is not clear whether it is 
correct for BNs with more compact representations of the 
CPDs. The other direction is clear for classic BNs with tab¬ 
ular representation; given a BN with tabular representation 
of its CPDs, we can build an SPN that represents the same 
joint probability distribution in time and space complexity 
that may be exponential in the tree-width of the BN. Briefly, 
this is done by first constructing a junction tree and trans¬ 
late it into an SPN^. However, to the best of our knowledge, 
it is still unknown how to convert an SPN into a BN and 
whether the conversion will lead to a blow up when more 
compact representations than tables and trees are used for 
the CPDs. 

We prove in this paper that by adopting Algebraic Deci¬ 
sion Diagrams (ADDs) (Bahar et al., 1997) to represent the 
CPDs at each node in a BN, every SPN can be converted 
into a BN in linear time and space complexity in the size of 
the SPN. The generated BN has a simple bipartite structure, 
which facilitates the analysis of the structure of an SPN in 
terms of the structure of the generated BN. Eurthermore, 

'joint distributions over continuous variables are also possi¬ 
ble, but we will restrict ourselves to discrete variables in this pa¬ 
per. 

^ http: // spn. cs .Washington .edu/faq. shtml 
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we show that by applying the Variable Elimination (VE) 
algorithm (Zhang & Poole, 1996) to the generated BN with 
ADD representation of its CPDs, we can recover the origi¬ 
nal SPN in linear time and space with respect to the size of 
the SPN. 

Our contributions can be summarized as follows. Eirst, we 
present a constructive algorithm and a proof for the con¬ 
version of SPNs into BNs using ADDs to represent the lo¬ 
cal CPDs. The conversion process is bounded by a lin¬ 
ear function of the size of the SPN in both time and space. 
This gives a new perspective to understand the probabilis¬ 
tic semantics implied by the structure of an SPN through 
the generated BN. Second, we show that by executing VE 
on the generated BN, we can recover the original SPN in 
linear time and space complexity in the size of the SPN. 
Combined with the first point, this establishes a clear re¬ 
lationship between SPNs and BNs. Third, we introduce 
the subclass of normal SPNs and show that every SPN 
can be transformed into a normal SPN in quadratic time 
and space. Compared with general SPNs, the structure of 
normal SPNs exhibit more intuitive probabilistic semantics 
and hence normal SPNs are used as a bridge in the conver¬ 
sion of general SPNs to BNs. Eourth, our construction and 
analysis provides a new direction for learning the parame¬ 
ter/structure of BNs since the SPNs produced by the algo¬ 
rithms that learn SPNs (Dennis & Ventura, 2012; Gens & 
Domingos, 2013; Peharz et al., 2013; Rooshenas & Lowd, 
2014) can be converted into BNs. 

2. Related Work 

Exact probabilistic reasoning has a close connection with 
propositional logic and weighted model counting (Roth, 
1996; Gomes et al., 2008; Bacchus et al., 2003; Sang et al., 
2005). The model counting problem, #SAT, is the prob¬ 
lem of computing the number of models for a given propo¬ 
sitional formula, i.e., the number of distinct truth assign¬ 
ments of the variables for which the formula evaluates to 
TRUE. In its weighted version, each boolean variable X 
has a weight Pr(a;) S [0,1] when set to TRUE and a 
weight 1 — Pr(a:) when set to FALSE. The weight of a 
truth assignment is the product of the weights of its literals. 
The weighted model counting problem then asks the sum 
of the weights of all satisfying truth assignments. There 
are two important streams of research for exact weighted 
model counting and exact probabilistic reasoning that re¬ 
late to SPNs: DPLL-style exhaustive search (Birnbaum & 
Lozinskii, 2011) and those based on knowledge compila¬ 
tion, e.g.. Binary Decision Diagrams (BDDs), Decompos¬ 
able Negation Normal Eorms (DNNEs) and Arithmetic Cir¬ 
cuits (ACs) (Bryant, 1986; Darwiche, 2001; 2000) . 

The SPN, as an inference machine, has a close connec¬ 
tion with the broader field of knowledge representation and 


knowledge compilation. In knowledge compilation, the 
reasoning process is divided into two phases: an offline 
compilation phase and an online query-answering phase. In 
the offline phase, the knowledge base, either propositional 
theory or belief network, is compiled into some tractable 
target language. In the online phase, the compiled target 
model is used to answer a large number of queries effi¬ 
ciently. The key motivation of knowledge compilation is to 
shift the computation that is common to many queries from 
the online phase into the offline phase. As an example, 
ACs have been studied and used extensively in both knowl¬ 
edge representation and probabilistic inference (Darwiche, 
2000; Huang et al., 2006; Chavira et al., 2006). Rooshenas 
&. Lowd (2014) recently showed that ACs and SPNs can be 
converted mutually without an exponential blow-up in both 
time and space. As a direct result, ACs and SPNs share the 
same expressiveness for probabilistic reasoning. 

Another representation closely related to SPNs in propo¬ 
sitional logic and knowledge representation is the 
deterministic-Decomposable Negation Normal Eorm (d- 
DNNE) (Darwiche & Marquis, 2001). Propositional for¬ 
mulas in d-DNNE are represented by a directed acyclic 
graph (DAG) structure to enable the re-usability of sub¬ 
formulas. The terminal nodes of the DAG are literals and 
the internal nodes are AND or OR operators. Like SPNs, 
d-DNNE formulas can be queried to answer satisfiability 
and model counting problems. We refer interested readers 
to Darwiche & Marquis (2001) and Darwiche (2001) for 
more detailed discussions. 

Since their introduction by Poon & Domingos (2011), 
SPNs have generated a lot of interest as a tractable class 
of models for probabilistic inference in machine learn¬ 
ing. Discriminative learning techniques for SPNs have 
been proposed and applied to image classification (Gens 
& Domingos, 2012). Later, automatic structure learn¬ 
ing algorithms were developed to build tree-structured 
SPNs directly from data (Dennis & Ventura, 2012; Peharz 
et al., 2013; Gens & Domingos, 2013; Rooshenas & Lowd, 
2014). SPNs have also been applied to various fields and 
have generated promising results, including activity mod¬ 
eling (Amer & Todorovic, 2012), speech modeling (Peharz 
et al., 2014) and language modeling (Cheng et al., 2014). 
Theoretical work investigating the influence of the depth of 
SPNs on expressiveness exists (Delalleau & Bengio, 201 1), 
but is quite limited. As discussed later, our results rein¬ 
force previous theoretical results about the depth of SPNs 
and provide further insights about the structure of SPNs by 
examining the structure of equivalent BNs. 

3. Preliminaries 

We start by introducing the notation used in this paper. We 
use 1 : TV to abbreviate the notation {1,2,... ,N}. We 
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use a capital letter X to denote a random variable and a 
bold capital letter Xi:Ar to denote a set of random variables 
Xi:Ar = {Xi,..., Xtv}. Similarly, a lowercase letter x is 
used to denote a value taken by X and a bold lowercase 
letter xi:jv denotes a joint value taken by the correspond¬ 
ing vector Xi-jv of random variables. We may omit the 
subscript 1 : N from Xi-^ and xi-^ if it is clear from the 
context. For a random variable Xi, we use xl,j G 1 : J 
to enumerate all the values taken by Xi. For simplicity, 
we use Pr(x) to mean Pr(X = x) and Pr(x) to mean 
Pr(X = x). We use calligraphic letters to denote graphs 
(e.g., 5). In particular, BNs, SPNs and ADDs are denoted 
respectively by B, S and A. For a DAG Q and a node v 
in Q, we use Qy to denote the subgraph of Q induced by v 
and all its descendants. Let V be a subset of the nodes of 
Q, then ^|v is a subgraph of Q induced by the node set V. 
Similarly, we use X|^ or x|^ to denote the restriction of a 
vector to a subset A. We use node and vertex, arc and edge 
interchangeably when we refer to a graph. Other notation 
will be introduced when needed. 

To ensure that the paper is self contained, we briefly re¬ 
view some background material about Bayesian Networks, 
Algebraic Decision Diagrams and Sum-Product Networks. 
Readers who are already familiar with those models can 
skip the following subsections. 

3.1. Bayesian Network 

Consider a problem whose domain is characterized by a set 
of random variables Xi:Ar with finite support. The joint 
probability distribution over Xi-n can be characterized by 
a Bayesian Network, which is a DAG where nodes repre¬ 
sent the random variables and edges represent probabilistic 
dependencies among the variables. In a BN, we also use 
the terms “node” and “variable” interchangeably. For each 
variable in a BN, there is a local conditional probability 
distribution (CPD) over the variable given its parents in the 
BN. 

The structure of a BN encodes conditional independencies 
among the variables in it. Let Xi, X 2 ,..., Xjv be a topo¬ 
logical ordering of all the nodes in a BN^, and let ttx; be 
the set of parents of node Xi in the BN. Each variable in a 
BN is conditionally independent of all its non-descendants 
given its parents. Hence, the joint probability distribution 
over Xi:jv admits the factorization in Eq. 1. 

N N 

= n I = n Pr(X. I TTxJ 

i—1 i—1 

( 1 ) 

Given the factorization, one can use various inference al- 

^ A topological ordering of nodes in a DAG is a linear ordering 
of its nodes such that each node appears after all its parents in this 
ordering. 


gorithms to do probabilistic reasoning in BNs. See Wain- 
wright & Jordan (2008) for a comprehensive survey. 

3.2. Algebraic Decision Diagram 

We first give a formal definition of Algebraic Decision Di¬ 
agrams (ADDs) for variables with Boolean domains and 
then extend the definition to domains corresponding to ar¬ 
bitrary finite sets. 

Definition 1 (Algebraic Decision Diagram (Bahar et al., 
1997)). An Algebraic Decision Diagram (ADD) is a graph¬ 
ical representation of a real function with Boolean input 
variables: / : {0,1}^ K, where the graph is a rooted 
DAG. There are two kinds of nodes in an ADD. Terminal 
nodes, whose out-degree is 0, are associated with real val¬ 
ues. Internal nodes, whose out-degree is 2, are associated 
with Boolean variables X„,n G 1 N. Eor each internal 
node Xn, the left out-edge is labeled with X„ = FALSE 
and the right out-edge is labeled with Xy, = TRUE. 

We extend the original definition of an ADD by allowing 
it to represent not only functions of Boolean variables, but 
also any function of discrete variables with a finite set as 
domain. This can be done by allowing each internal node 
Xy to have \Xn \ out-edges and label each edge with x^y,j G 
1 : \Xy\, where Xy is the domain of variable Xy and \Xy\ 
is the number of values X„ takes. Such an ADD represents 
a function f : Xi x ■ ■ ■ x Xn i— f M, where x means the 
Cartesian product between two sets. Henceforth, we will 
use our extended definition of ADDs throughout the paper. 

Eor our purpose, we will use an ADD as a compact graphi¬ 
cal representation of local CPDs associated with each node 
in a BN. This is a key insight of our constructive proof pre¬ 
sented later. Compared with a tabular representation or 
a decision tree representation of local CPDs, CPDs rep¬ 
resented by ADDs can fully exploit CSI (Boutilier et al., 
1996) and effectively avoid the replication problem (Pa- 
gallo, 1989) of the decision tree representation. 

We give an example in Pig. 1 where the tabular representa¬ 
tion, decision-tree representation and ADD representation 
of a function of 4 Boolean variables is presented. Another 
advantage of ADDs to represent local CPDs is that arith¬ 
metic operations such as multiplying ADDs and summing- 
out a variable from an ADD can be implemented efficiently 
in polynomial time. This will allow us to use ADDs in the 
Variable Elimination (VE) algorithm to recover the original 
SPN after its conversion to a BN with CPDs represented by 
ADDs. Readers are referred to Bahar et al. (1997) for more 
detailed and thorough discussions about ADDs. 

3.3. Sum-Product Network 

Before introducing SPNs, we first define the notion of net¬ 
work polynomial, which plays an important role in our 
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(a) Tabular representation. 


(b) Decision-Tree representation. 


(c) ADD representation. 


Figure 1. Different representations of the same Boolean function. The tabular representation cannot exploit CSI and the Decision-Tree 
representation cannot reuse isomorphic subgraphs. The ADD representation can fully exploit CSI by sharing isomorphic subgraphs, 
which makes it the most compact representation among the three representations. In Fig. 1(b) and Fig. 1(c), the left and right branches 
of each internal node correspond respectively to FALSE and TRUE. 


proof. We use 1[X = x] to denote an indicator that returns 
1 when X = X and 0 otherwise. To simplify the notation, 
we will use to represent = a:]. 

Definition 2 (Network Polynomial (Poon & Domingos, 
2011)). Let /(•) > 0 be an unnormalized probability 
distribution over a Boolean random vector Xi:Ar. The 
network polynomial of /(•) is a multilinear function 
/(^) nn=i of indicator variables, where the sum¬ 
mation is over all possible instantiations of the Boolean 
random vector Xi:^?. 

Intuitively, the network polynomial is a Boolean expan¬ 
sion (Boole, 1847) of the unnormalized probability dis¬ 
tribution /(•). For example, the network polynomial of a 
BN Xi ->■ X 2 is Fi{xi,X 2 )lxi^x 2 + Pr(a;i,T 2 )Ia:iIs 2 + 
Pr(xx, X2')^xi^x2 T Pr(xx, X2^^xi^x2‘ 

Definition 3 (Sum-Product Network (Poon & Domingos, 
2011)). A Sum-Product Network (SPN) over Boolean vari¬ 
ables Xi:jv is a rooted DAG whose leaves are the indicators 
,..., and ,..., and whose internal nodes are 
sums and products. Each edge {vi,Vj) emanating from a 
sum node Vi has a non-negative weight wij. The value of 
a product node is the product of the values of its children. 
The value of a sum node is J2v GCh{v ) Wijval{vj) where 
Ch{vi) are the children of Vi and val{vj) is the value of 
node Vj. The value of an SPN 5[Ixi, Igi,..., , II$„] is 

the value of its root. 

The scope of a node in an SPN is defined as the set of vari¬ 
ables that have indicators among the node’s descendants: 
For any node v in an SPN, if u is a terminal node, say, 
an indicator variable over X, then scope(u) = {X}, else 
scope(n) = y}y^ch{v) scope(u). Poon &. Domingos (2011) 
further define the following properties of an SPN: 

Definition 4 (Complete). An SPN is complete iff each sum 


node has children with the same scope. 

Definition 5 (Consistent). An SPN is consistent iff no vari¬ 
able appears negated in one child of a product node and 
non-negated in another. 

Definition 6 (Decomposable). An SPN is decomposable 
iff for every product node v, scope(ui) P| scopeCuj) = 0 
where Vi, vj G Ch{v), i 7 ^ j. 

Clearly, decomposability implies consistency in SPNs. An 
SPN is said to be valid iff it defines a (unnormalized) prob¬ 
ability distribution. Poon & Domingos (2011) proved that 
if an SPN is complete and consistent, then it is valid. Note 
that this is a sufficient, but not necessary condition. In this 
paper, we focus only on complete and consistent SPNs as 
we are interested in their associated probabilistic seman¬ 
tics. For a complete and consistent SPN S, each node v in 
S defines a network polynomial /^(•) which corresponds to 
the sub-SPN rooted at v. The network polynomial defined 
by the root of the SPN can then be computed recursively by 
taking a weighted sum of the network polynomials defined 
by the sub-SPNs rooted at the children of each sum node 
and a product of the network polynomials defined by the 
sub-SPNs rooted at the children of each product node. The 
probability distribution induced by an SPN S is defined as 
Pr 5 (x) = fs{') is the network polyno¬ 

mial defined by the root of the SPN S. An example of a 
complete and consistent SPN is given in Fig. 2. 

4. Main Results 

In this section, we first state the main results obtained in 
this paper and then provide detailed proofs with some dis¬ 
cussion of the results. To keep the presentation simple, we 
assume without loss of generality that all the random vari¬ 
ables are Boolean unless explicitly stated. It is straightfor- 
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Figure 2. A complete and consistent SPN over Boolean variables 
Xi, X 2 - This SPN is also decomposable since every product node 


has children whose scopes do not intersect. The network polyno¬ 
mial defined by (the root of) this SPN is: f{Xi, X 2 ) = 10(6Ia;i + 
41$! ) {6^X2 + 141^2 ) + 6(6Ia;j + 41x1 ) (2Ia:2 + ) + 9(9Ia:i + 

^xi){2I.X2 + §^^^ 2 ) ~ ii^4I.xi^x2 + 1776]IxiIs 2 306]Is;^Ia;2 T 


8241^ j 1^2 the probability distribution induced by 5 is Pr 5 = 


3500 ^2 


1776 ¥ I 306 

3500““^1^^2 "T 3500 


3500 


ward to extend our analysis to discrete random variables 
with finite support. For an SPN S, let |5| be the size of the 
SPN, i.e., the number of nodes plus the number of edges in 
the graph. For a BN B, the size of B, \B\, is defined by the 
size of the graph plus the size of all the CPDs in B (the size 
of a CPD depends on its representation, which will be clear 
from the context). The main theorems are: 

Theorem 1. There exists an algorithm that converts any 
complete and decomposable SPN S over Boolean variables 
Xi:Ar into a BN B with CPDs represented by ADDs in time 
0(iV|5|). Furthermore, S and B represent the same distri¬ 
bution and \B\ = 0(Af|iS|). 


Remark 3. The combination of Theorems 1 and 3 shows 
that distributions for which SPNs allow a compact repre¬ 
sentation and efficient inference, BNs with ADDs also al¬ 
low a compact representation and efficient inference (i.e., 
no exponential blow up). 

To make the upcoming proofs concise, we first define a 
normal form for SPNs and show that every complete and 
consistent SPN can be transformed into a normal SPN in 
quadratic time and space without changing the network 
polynomial. We then derive the proofs with normal SPNs. 
Note that we only focus on SPNs that are complete and con¬ 
sistent. Hence, when we refer to an SPN, we assume that it 
is complete and consistent without explicitly stating this. 

4.1. Normal Form 

For an SPN S, let / 5 ( ) be the network polynomial defined 
at the root of S. Define the height of an SPN to be the 
length of the longest path from the root to a terminal node. 
Definition 7. An SPN is said to be normal if 

1. It is complete and decomposable. 

2. For each sum node in the SPN, the weights of the 
edges emanating from the sum node are nonnegative 
and sum to 1 . 

3. Every terminal node in the SPN is a univariate dis¬ 
tribution over a Boolean variable and the size of the 
scope of a sum node is at least 2 (sum nodes whose 
scope is of size 1 are reduced into terminal nodes). 

Theorem 4. For any complete and consistent SPN S, there 
exists a normal SPN S' such that Pr 5 (-) = Pr 5 '(-) and 
|5'| = 0(|5p). 


As it will be clear later, Thm. 1 immediately leads to the 
following corollary: 

Corollary 2. There exists an algorithm that converts any 
complete and consistent SPN S over Boolean variables 
Xi:Ar into a BN B with CPDs represented by ADDs in time 
0(W|5p). Furthermore, S and B represent the same dis¬ 
tribution and \B\ = 0(7V|5|2). 

Remark 1. The BN B generated from S in Theorem 1 and 
Corollary 2 has a simple bipartite DAG structure, where 
all the source nodes are hidden variables and the terminal 
nodes are the Boolean variables Xi:jv. 

Remark 2. Assuming sum nodes alternate with product 
nodes in SPN S, the depth of S is proportional to the max¬ 
imum in-degree of the nodes in B, which, as a result, is 
proportional to a lower bound of the tree-width of B. 

Theorem 3. Given the BN B with ADD representation of 
CPDs generated from a complete and decomposable SPN 
S over Boolean variables Xi-^v, the original SPN S can be 
recovered by applying the Variable Elimination algorithm 
toSinO(Af|5|). 


To show this, we first prove the following lemmas. 

Lemma 5. Eor any complete and consistent SPN S over 
Xijv, there exists a complete and decomposable SPN S' 
over Xi.jv such that /^(x) = / 5 /(x),Vx and |5'| = 
0(|5p). 

Proof. Let 5 be a complete and consistent SPN. If it is 
also decomposable, then simply set S' — S and we are 
done. Otherwise, let rii,..., vm be an inverse topologi¬ 
cal ordering of all the nodes in S, including both terminal 
nodes and internal nodes, such that for any Um, m € 1 : M, 
all the ancestors of Vm in the graph appear after Vm in the 
ordering. Let Vm be the first product node in the order¬ 
ing that violates decomposability. Let , Vm 2 , ■ ■ ■, Vmi 
be the children of Vm where toi < m 2 < • • • < 
mi < m (due to the inverse topological ordering). Let 
(■nm;, fmj), * < j,hj € 1 ■ ? be the first ordered pair of 

nodes such that scope n 7 ^ Hence, 

let X G scope(umJ Consider and 

fy^ which are the network polynomials defined by the 
sub-SPNs rooted at Vmi and Vmy 
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Expand network polynomials and fy^ into a sum- 
of-product form by applying the distributive law between 
products and sums. For example, if f{Xi,X 2 ) = (Ixi + 
9Ixi )( 4 Ia ;2 + 61 ^ 2 ), then the expansion of / is /(Xi, X 2 ) = 
+ 361gilx2 + Since S is com¬ 

plete, then sub-SPNs rooted at and Vm. are also com¬ 
plete, which means that each monomial in the expansion 
of fy^, must share the same scope. The same applies to 
fy^ .. Since X S scope(z;mJ H )’ '^hen every 

monomial in the expansion of fy^, and must con¬ 
tain an indicator variable over X, either or I^. Fur¬ 
thermore, since S is consistent, then the sub-SPN rooted 
at Vm is also consistent. Consider fy^ , = ni=i = 
fvmi fv„,- ■ Because Vm is consistent, we 

know that each monomial in the expansions of and 
fy^ must contain the same indicator variable of X, either 
lx or lx, otherwise there will be a term Ixlx in /«„, which 
violates the consistency assumption. Without loss of gen¬ 
erality, assume each monomial in the expansions of 
and fy^ contains I^;. Then we can re-factorize fy^ in the 
following way: 


=n =>; 

k^l 

- li.T" -rx w 


fvm 


n 




n fvrr.^ =^xfvy,Jvy,. 






( 2 ) 


where we use the fact that indicator variables are idem- 
potent, i.e., lx = lx and /«„. (/«„ .) is defined as the 
function by factorizing out from fy^Xfym )- Bq. 2 
means that in order to make Vm decomposable, we can sim¬ 
ply remove all the indicator variables from sub-SPNs 
rooted at Vmi and Vm and later link lx to Vm directly. 
Such a transformation will not change the network poly¬ 
nomial fy^ as shown by Eq. 2, but it will remove X from 
scope(um.) n scop 6 ('fm 2 )- In principle, we can apply this 
transformation to all ordered pairs (vmi ,Vm XX < jXX S 
1 : I with nonempty intersections of scope. However, this is 
not algorithmically efficient and more importantly, for local 
components containing lx in fy^ which are reused by other 
nodes Vn outside of Sy^, we cannot remove from them 
otherwise the network polynomials for each such Vn will be 
changed due to the removal. In such case, we need to dupli¬ 
cate the local components to ensure that local transforma¬ 
tions with respect to do not affect network polynomials 
fy^. We present the transformation in Alg. 1. Alg. 1 trans¬ 
forms a complete and consistent SPN S into a complete 
and decomposable SPN S'. Informally, it works using the 


Algorithm 1 Decomposition Transformation 
Input: Complete and consistent SPN S. 

Output: Complete and decomposable SPN S'. 

1 : Fet vi,V 2 ,..., vm be an inverse topological ordering 
of nodes in S. 

2 : for m = 1 to M do 

3: if Vm is a non-decomposable product node then 

4: ^{Vm) ^ scope(umJ n scope(um 2 ) 

5: V ^ {u G Sy^ I scope(u) Q ^(vm) X 

6 : iSv ^IV 

7: D{vm) ^ descendants of Vm 

8 : for node v G 5v\{um} do 

9: if Pa{v)\D{vm) X ^ 

10: Create p^v® nxGa(«„)nscope(«) 

11 : Connect p to V/ G Pa{v)\D{vm) 

12 : Disconnect v from V/ G Pa{v)\D{vm) 

13: end if 

14: end for 

15: for node v G 5v in bottom-up order do 

16: Disconnect v G Ch(v) Vscope('D) C H(um) 

17: end for 

18: Connect Ilxenii;™) directly 

19: end if 

20: end for 

21 : Delete all nodes unreachable from the root of S 
22 : Delete all product nodes with out-degree 0 
23: Contract all product nodes with out-degree 1 


following identity: 




II I n 






XGn('Urn)nscope('UT7i. ) 


K 


(3) 


where scope(um,)nscope(umJ, i.e., 

^(vm) is the union of all the shared variables between pairs 
of children of Vm and lx* is the indicator variable of X G 
Vl{vm) appearing in Sy^. Based on the analysis above, we 
know that for each X G there will be only one kind 

of indicator variable that appears inside Sy^, otherwise 
Vm is not consistent. In Fine 6 , Sy^ | v is defined as the sub- 
SPN of Sy^ induced by the node set V, i.e., a subgraph of 
Sy^ where the node set is restricted to V. In Fines 5-6, 
we first extract the induced sub-SPN S^/ from Sy^ rooted 
at Vm using the node set in which nodes have nonempty 
intersections with fi(vm)- We disconnect the nodes in S\r 
from their children if their children are indicator variables 
of a subset of fl(vm) (Lines 15-17). At Line 18, we build a 
new product node by multiplying all the indicator variables 
in fi(vm) and link it to Vm directly. To keep unchanged the 
network polynomials of nodes outside Sy^ that use nodes 
in 5v, we create a duplicate node p for each such node v 
and link p to all the parents of v outside of Sy^ and at the 
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same time delete the original link (Lines 9-13). 

In summary, Lines 15-17 ensure that Vm. is decomposable 
by removing all the shared indicator variables in n{vm)- 
Line 18 together with Eq. 3 guarantee that is un¬ 
changed after the transformation. Lines 9-13 create nec¬ 
essary duplicates to ensure that other network polynomials 
are not affected. Lines 21-23 simplify the transformed SPN 
to make it more compact. An example is depicted in Fig. 3 
to illustrate the transformation process. 




Figure 3. Transformation process described in Alg. 1 to construct 
a complete and decomposable SPN from a complete and consis¬ 
tent SPN. The product node Vm in the left SPN is not decompos¬ 
able. Induced sub-SPN Sv^ is highlighted in blue and 5v is high¬ 
lighted in green. Vm 2 highlighted in red is reused by which is 
outside Svm ■ To compensate for Vm 2 , we create a new product 
node p in the right SPN and connect it to indicator variable 
and Vm 2 ■ Dashed gray lines in the right SPN denote deleted edges 
and nodes while red edges and nodes are added during Alg. 1. 

We now analyze the size of the SPN constructed by Alg. 1 . 
For a graph S, let 93(5) be the number of nodes in S and 
let (S(S) be the number of edges in S. Note that in Lines 
8-17 we only focus on nodes that appear in the induced 
SPN 5v, which clearly has |5v| — 1‘^Vml- Furthermore, 
we create a new product node p at Line 10 iff u is reused 
by other nodes which do not appear in Sy^. This means 
that the number of nodes created during each iteration be¬ 
tween Lines 2 and 20 is bounded by 9J(5v) 5 
Line 10 also creates 2 new edges to connect p to v and the 
indicator variables. Lines 11 and 12 first connect edges to p 
and then delete edges from v, hence these two steps do not 
yield increases in the number of edges. So the increase in 
the number of edges is bounded by 293(5v) < 293(5„^). 
Combining increases in both nodes and edges, during each 
outer iteration the increase in size is bounded by 3|5v| < 
= 0(|5|). There will be at most M = 03(5) outer 
iterations hence the total increase in size will be bounded 
by 0(M|5|) = 0(|5n. □ 

Lemma 6. For any complete and decomposable SPN 5 
over Xi:jv that satisfies condition 2 of Def. 7, fs (x) = 

1 . 

Proof. We give a proof by induction on the height of 5. 


Let R be the root of 5. 

• Base case. SPNs of height 0 are indicator variables 
over some Boolean variable whose network polyno¬ 
mials immediately satisfy Lemma 6. 

• Induction step. Assume Lemma 6 holds for any SPN 
with height < k. Consider an SPN 5 with height fc-fl. 
We consider the following two cases: 

- The root i? of 5 is a product node. Then in this 
case the network polynomial fs{-) for 5 is de¬ 
fined as fs = UvGCh(R) fv We have 

5i/s(x)=5i n /•u (^lscope('u) ) (4) 

X X v^Ch{R) 

^ ^ /l* (^lscope('u)) (^) 

V^Ch(yR') x|scope{u) 

= ll 1 = 1 ( 6 ) 

veCh{R) 

where x|scope(.u) means that x is restricted to the 
set scope(z;). Eq. 5 follows from the decompos- 
ability of R and Eq. 6 follows from the induction 
hypothesis. 

- The root i? of 5 is a sum node. The network 
polynomial is fs = Y.veCh{R) WR,vfv We have 

= Wli^yfy{x) (7) 

X X v^Ch{R) 

= X! WR^y^fy{X.) (8) 

veCh{R) X 

= X! ^ 

veCh{R) 

Eq. 8 follows from the commutative and associa¬ 
tive law of addition and Eq. 9 follows by the in¬ 
duction hypothesis. 

□ 

Corollary 7. For any complete and decomposable SPN 5 
over Xi-jv that satisfies condition 2 of Def 7, Pr 5 (-) = 
fsi-)- 

Lemma 8. For any complete and decomposable SPN 5, 
there exists an SPN S' where the weights of the edges em¬ 
anating from every sum node are nonnegative and sum to 
l,and Pr 5 (-) = Pr5'(-),|5'| = |5|. 

Proof. Alg. 2 runs in one pass of 5 to construct the re¬ 
quired SPN S'. We proceed to prove that the SPN S' re¬ 
turned by Alg. 2 satisfies Pr 5 /(-) = Pr 5 (-), |5'| = |5| 
and that S' satisfies condition 2 of Def 7. It is clear that 
15'I = |5| because we only modify the weights of 5 to 
construct S' at Line 7. Based on Lines 6 and 7, it is also 
straightforward to verify that for each sum node v in S', 
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Algorithm 2 Weight Normalization 

Input: SPN 5 


Output: SPN 5' 


1 

S' ^S 


2 

Val{lx) ^ 1, Via; G S 


3 

Let vi,..., vm be an inverse topological ordering of 


the nodes in S 


4 

for m = 1 to M do 


5 

if Vm is a sum node then 


6 

val{vm) -G- 

) Wv.^,vVal{v) 

7 

/ , Tjua/f-u) 

'G~ vaUvrr,) ’ 

Vu G Ch{Vm) 

8 

else if Um is a product node then 

9 

val{vm) G- Y\yfzCh{vm 

j val{v) 

10 

end if 


11 

end for 



the weights of the edges emanating from v are nonnega¬ 
tive and sum to 1. We now show that Pr 5 /(-) = Pr 5 (-). 
Using Corollary 7, Pr 5 '(-) = fs'i')- Hence it is suf¬ 
ficient to show that = Prs(-)- Before deriving 

a proof, it is helpful to note that for each node v G S, 
val{v) = fv{x\scove{v))- We give a proof by in¬ 

duction on the height of S. 

• Base case. SPNs with height 0 are indicator variables 
which automatically satisfy Lemma 8. 

• Induction step. Assume Lemma 8 holds for any SPN 
of height < k. Consider an SPN S of height fc-f 1. Let 
R be the root node of S with out-degree 1. We discuss 
the following two cases. 

- i? is a product node. Let be the 

children of R and Si,... ,Sihe the correspond¬ 
ing sub-SPNs. By induction, Alg. 2 returns 
S[,... ,Sl that satisfy Lemma 8. Since i? is a 
product node, we have 


/s' (^lscope(i?i) ) 
I 

— 11 P'^(^l.scope(iti)) 


fSi I scope(iti) ) 


I 

^ /5,(x|scope(fli)) 

ni=l /Si (^|.scope(iti)) 

Sx ni=l fSi (^lscope(fli)) 
/-gW - Pp(x) 


( 10 ) 

( 11 ) 

( 12 ) 

(13) 

(14) 


Eq. 11 follows from the induction hypothesis and 
Eq. 13 follows from the distributive law due to 
the decomposability of S. 


- i? is a sum node with weights wi,... ,wi > 0. 
We have 

i 

/s'(x) = ^w'/ 5 /(x) (15) 

(16) 

_ ^ WiVal{Ri) /5.(x) 

hi Ei=i WjVal{Rj) Ex /5.(x) 

(17) 

^ Wival{Ri) fsS^) 

^ Y!i=iWifsA'^) ^ fsjx) 

Y^\=iWjval{Rj) 

(19) 

= Pr(x) (20) 

5 

where Eqn. 16 follows from the induction hy¬ 
pothesis, Eq. 18 and 19 follow from the fact that 

val(v) = Ex|sc„pc(„) /'!'(x|scope(ii))) Vt; G S. 

This completes the proof since Pr 5 '(-) = fs'{') = Pr5(-)- 

□ 

Given a complete and decomposable SPN S, we now con¬ 
struct and show that the last condition in Def 7 can be sat¬ 
isfied in time and space 0(|5|). 

Lemma 9. Given a complete and decomposable SPN S, 
there exists an SPN S' satisfying condition 3 in Def. 7 such 
that Pr 5 /(-) = Pr 5 (-) and |5'| = 0(|5|). 

Proof. We give a proof by construction. Eirst, if S is not 
weight normalized, apply Alg. 2 to normalize the weights 
(i.e., the weights of the edges emanating from each sum 
node sum to 1). 

Now check each sum node u in 5 in a bottom-up order. 
If |scope(u)| = 1, by Corollary 7 we know the network 
polynomial /„ is a probability distribution over its scope, 
say, {2f}. Reduce v into a terminal node which is a dis¬ 
tribution over X induced by its network polynomial and 
disconnect v from all its children. The last step is to re¬ 
move all the unreachable nodes from S to obtain S'. Note 
that in this step we will only decrease the size of S, hence 
| 5 '| = 0 (| 5 |). □ 

Proof of Thm. 4. The combination of Lemma 5, 8 and 9 
completes the proof of Thm. 4. □ 

An example of a normal SPN constructed from the SPN in 
Pig. 2 is depicted in Pig. 4. 
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Figure 4. Transform an SPN into a normal form. Terminal nodes 
which are probability distributions over a single variable are rep¬ 
resented by a double-circle. 

4.2. SPN to BN 

In order to construct a BN from an SPN, we require the 
SPN to be in a normal form, otherwise we can first trans¬ 
form it into a normal form using Alg. 1 and 2. 

Let 5 be a normal SPN over Xi-jv- Before showing how 
to construct a corresponding BN, we first give some intu¬ 
itions. One useful view is to associate each sum node in 
an SPN with a hidden variable. For example, consider a 
sum node v G S with out-degree 1. Since S is normal, we 
have Wi > 0 ,Vj € 1 : L This natu¬ 

rally suggests that we can associate a hidden discrete ran¬ 
dom variable Hy with multinomial distribution Piy{Hy = 
i) = Wi,i G 1 : I for each sum node v G S. Therefore, 
S can be thought as defining a joint probability distribu¬ 
tion over Xi: 7 v and H = {Hy | u G 5, u is a sum node} 
where Xi: 7 v are the observable variables and H are the 
hidden variables. When doing inference with an SPN, we 
implicitly sum out all the hidden variables H and compute 
Pr^jx) = ^j^Pr 5 (x, h). Associating each sum node in 
an SPN with a hidden variable not only gives us a concep¬ 
tual understanding of the probability distribution defined 
by an SPN, but also helps to elucidate one of the key prop¬ 
erties implied by the structure of an SPN as summarized 
below: 

Proposition 10. Given a normal SPN S, let p be a product 
node in S with I children. Let vi,... ,Vk be sum nodes 
which lie on a path from the root of S to p. Then 


Since p is a decomposable product node, Sp admits the 
above factorization by the definition of a product node and 
Corollary 7. □ 

Note that there may exist multiple paths from the root to 
p in S. Each such path admits the factorization stated in 
Eq. 21. Eq. 21 explains two key insights implied by the 
structure of an SPN that will allow us to construct an equiv¬ 
alent BN with ADDs. Eirst, CSI is efficiently encoded by 
the structure of an SPN using Proposition 21. Second, the 
DAG structure of an SPN allows multiple assignments of 
hidden variables to share the same factorization, which ef¬ 
fectively avoids the replication problem presents in deci¬ 
sion trees. 

Based on the observations above and with the help of the 
normal form for SPNs, we now proceed to prove the first 
main result in this paper: Thm. 1. Eirst, we present the 
algorithm to construct the structure of a BN B from S in 
Alg. 3. In a nutshell, Alg. 3 creates an observable variable 


Algorithm 3 Build BN Structure 
Input: normal SPN S 
Output: BN B = (By, Be) 

1: R G- root of S 

2: if i? is a terminal node over variable X then 
3: Create an observable variable X 

4: Bv^BvU {X} 

5: else 

6: for each child Ri of R do 

7: if BN has not been built for Sn- then 

8: Recursively build BN Structure for Sjii 

9: end if 

10: end for 

11: if i? is a sum node then 

12: Create a hidden variable Hr associated with R 

13: By F- By U {He} 

14: for each observable variable X G Sr do 

15: BeBe S {{Hr, X)} 

16: end for 

17: end if 

18: end if 


^^^(xlseopelp) ■ j Hy,, — Vj^) — 


Pr(x|scope(pi) 


Hy^=Vl,...,Hy^=Vl) ( 21 ) 


where Hy = v* means the sum node v selects its u*th 
branch and x |.4 denotes restricting x by set A, pi is the ith 
child of product node p. 


Proof. Consider the sub-SPN Sp rooted at p. Sp can be ob¬ 
tained by restricting Hy^ = v},..., Hy,^ = v^, i.e., going 
from the root of S along the path Hy, = v},..., Hy,, = v{,. 


X inB for each terminal node over X in 5 (Lines 2-4). Eor 
each internal sum node v in S, Alg. 3 creates a hidden vari¬ 
able Hy associated with v and builds directed edges from 
Hy to all observable variables X appearing in the sub-SPN 
rooted at v (Lines 11-17). The BN B created by Alg. 3 has 
a directed bipartite structure with a layer of hidden vari¬ 
ables pointing to a layer of observable variables. A hidden 
variable H points to an observable variable X in iff X 
appears in the sub-SPN rooted at H in S. 

We now present Alg. 4 and 5 to build ADDs for each ob¬ 
servable variable X and hidden variable H in B. Eor each 
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Algorithm 4 Build CPD using ADD, observable variable 
Input: normal SPN S, variable X 

Output: ADD Ax 

1: if ADD has already been created for S and X then 
2: Ax ^ retrieve ADD from cache 

3: else 

4: R ^ root of S 

5: if i? is a terminal node then 

6: Ax ^ decision stump rooted at R 

1: else if i? is a sum node then 

8: Create a node Hr into Ax 

9: for each Ri € Ch{R) do 

10: Link BuildADD(5rj., A) as ith child of 

11: end for 

12: else if i? is a product node then 

13: Find child such that X G scope(Ri) 

14: Ax ^ BuildADD(5ii,, A) 

15: end if 

16: store Ax in cache 

17: end if 


Algorithm 5 Build CPD using ADD, hidden variable 
Input: normal SPN S, variable ff 
Output: ADD Ar 
1: Find the sum node H in S 
2: Ah g- decision stump rooted at iF in 5 


hidden variable H, Alg. 5 builds Ah as a decision stump^ 
obtained by finding H and its associated weights in S. 
Consider ADDs built by Alg. 4 for observable variables 
As. Let A be the current observable variable we are con¬ 
sidering. Basically, Alg. 4 is a recursive algorithm applied 
to each node in S whose scope intersects with {A}. There 
are three cases. If current node is a terminal node, then it 
must be a probability distribution over A. In this case we 
simply return the decision stump at the current node. If the 
current node is a sum node, then due to the completeness of 
S, we know that all the children of R share the same scope 
with R. We first create a node Hr corresponding to the 
hidden variable associated with R into Ax (Line 8) and 
recursively apply Alg. 4 to all the children of R and link 
them to Hr respectively. If the current node is a product 
node, then due to the decomposability of S, we know that 
there will be a unique child of R whose scope intersects 
with {A}. We recursively apply Alg. 4 to this child and 
return the resulting ADD (Lines 12-15). 

Equivalently, Alg. 4 can be understood in the following 
way: we extract the sub-SPN induced by {A} and con¬ 
tract^ all the product nodes in it to obtain Ax- Note that 

"'a decision stump is a decision tree with one variable. 

^In graph theory, the contraction of a node u in a DAG is the 
operation that connects each parent of v to each child of v and 


the contraction of product nodes will not add more edges 
into Ax since the out-degree of each product node in the 
induced sub-SPN must be 1 due to the decomposability of 
the product node. We illustrate the application of Alg. 3, 4 
and 5 on the normal SPN in Fig. 4, which results in the BN 
B with CPDs represented by ADDs shown in Fig. 5. 

We now show that Pr^ (x) = Prg (x) Vx. 

Lemma 11. Given a normal SPN S, the ADDs constructed 
by Alg. 4 and 5 encode local CPDs at each node in B. 

Proof. It is easy to verify that for each hidden variable H 
in B, Ah represents a local CPD since Ah is a decision 
stump with normalized weights. 

For any observable variable A in B, let Pa{X) be the set 
of parents of A. By Alg. 3, every node in Pa{X) is a 
hidden variable. Furthermore, VH, H G Pa{X) iff there 
exists one terminal node over A in 5 that appears in the 
sub-SPN rooted at H. Hence given any joint assignment 
h of Pa{X), there will be a path in Ax from the root to 
a terminal node that is consistent with the joint assignment 
of the parents. Also, the leaves in Ax contain normalized 
weights corresponding to the probabilities of A (see Def. 7) 
induced by the creation of decision stumps over A in Lines 
5-6 of Alg. 4. □ 

Theorem 12. For any normal SPN S over Xi: 7 v. the BN B 
constructed by Alg. 3,4 and 5 encodes the same probability 
distribution, i.e., Pr^jx) = PrgjxjjVx. 

Proof. Again, we give a proof by induction on the height 
of S. 

• Base case. The height of SPN S is 0. In this case, 
S will be a single terminal node over A and B will 
be a single observable node with decision stump Ax 
constructed from the terminal node by Lines 5-6 in 
Alg. 4. It is clear that Pr 5 (a;) =PrB(x),Va;. 

• Induction step. Assume Prgjx) = PrB(x),Vx for 
any S with height < k, where B is the corresponding 
BN constructed by Alg. 3, 4 and 5 from S. Consider 
an SPN S with height k-\-l. Let R be the root of S and 
Ri,i G 1 : I he the children of R in S. We consider 
the following two cases: 

- i? is a product node. Let scope(i?() = G 

1 : I. Claim: there is no edge between 
and Sj,i 7 ^ j, where Si(Sj) is the sub-SPN 
rooted at Ri{Rj). If there is an edge, say, from 
Vj to Vi where Vj G Sj and Vi G Si, then 
scope(ui) C scope( scope (i?y). On the 

other hand, scope (u^) C scope So we 

have 0 A scope(ui) C scope) P| scope(i?j), 
which contradicts the decomposability of the 

then delete v from the graph. 
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Figure 5. Construct a BN with CPDs represented by ADDs from an SPN. On the left, the induced sub-SPNs used to create Axi and 
.4x2 by Alg. 4 are indicated in blue and green respectively. The decision stump used to create Ah by Alg. 5 is indicated in red. 


product node R. Hence the constructed BN B 
will be a forest of I disconnected components, 
and each component Bt will correspond to the 
sub-SPN St rooted at S 1 : /, with 

height < k. By the induction hypothesis we have 
PrBj(xt) = PrBj(xt),Vf S 1 : L Consider the 
whole BN B, we have: 


pr(x)= 

( 22 ) 

where the hrst equation is due to the d-separation 
rule in BNs by noting that each component Bt 
is disconnected from all other components. The 
second equation follows from the induction hy¬ 
pothesis. The last equation follows from the def¬ 
inition of a product node. 

i? is a sum node. In this case, due to the com¬ 
pleteness of S, all the children of R share the 
same scope as R. By the construction process 
presented in Alg. 3, 4 and 5, there is a hidden 
variable H corresponding to R that takes I dif¬ 
ferent values in B. Let wia be the weights of 
the edges emanating from R in S. For the fth 
branch of R, we use Hj to denote the set of hid¬ 
den variables in B that also appear in Bt, and let 
H_t = H\Ht, where H is the set of all hidden 
variables in B except H. First, we show the fol¬ 


lowing identity: 

Pr(x|iJ = ht) = ^^Pr(x,ht,h_t|iF = ht) 

ht h_t 

(23) 

= E E ht Iff = ht, h_t) Pr(h_t Iff = ht) 

ht h_f 

(24) 

= E E ht = ht) Pr(h_t|ff = ht) 

ht h_t 

(25) 

= ^Pr(x,ht|ff = ht)5]Pr(h_t|ff = ht) 

ht h^t 



(26) 

EPr(x,ht|ff = ht) 

ht 

(27) 

EPr(x,ht) = Pr(x) 

ht * 

(28) 


Using this identity, we have 

i 


^Pr(ht)Pr(x|ff = M 

(29) 

1 

E^‘ Pr(x) 

£-1 ^ 

(30) 

1 

E 

t=i 

(31) 

Pr(x) 

5 

(32) 


Eq. 25 follows from the fact that X and Ht are 
independent of H_t given ff = ht, i.e., we take 
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advantage of the CSI described by ADDs of X. 
Eq. 26 follows from the fact that H_t appears 
only in the second term. Combined with the fact 
that H = ht is given as evidence in B, this gives 
us the induced subgraph Bt referred to in Eq. 28. 
Eq. 30 follows from Eq. 28 and Eq. 31 follows 
from the induction hypothesis. 

Combing the base case and the induction step completes 
the proof for Thm. 12. □ 

We now bound the size of B: 

Theorem 13. \B\ = 0(A^|iS|), where BN B is constructed 
by Alg. 3, 4 and 5 from normal SEN S over 'Ki-n. 

Proof. Eor each observable variable X in B, Ax is con¬ 
structed by first extracting from S the induced sub-SPN Sx 
that contains all nodes whose scope includes X and then 
contracting all the product nodes in Sx to obtain Ax- By 
the decomposability of product nodes, each product node 
in Sx has out-degree 1 otherwise the original SPN S vio¬ 
lates the decomposability property. Since contracting prod¬ 
uct nodes does not increase the number of edges in Sx, we 
have \Ax\ < < \S\. 

Eor each hidden variable H in B, Ah is a decision stump 
constructed from the internal sum node corresponding to 
H in S. Hence, we have S \S\. 

Now consider the size of the graph B. Note that only ter¬ 
minal nodes and sum nodes will have corresponding vari¬ 
ables in B. It is clear that the number of nodes in B is 
bounded by the number of nodes in S. Eurthermore, a hid¬ 
den variable H points to an observable variable X inB iff 
X appears in the sub-SPN rooted at H in S, i.e., there is 
a path from the sum node corresponding to H to one of 
the terminal nodes in X. Eor a sum node H (which corre¬ 
sponds to a hidden variable H £ B) with scope size s, each 
edge emanated from H in S will correspond to directed 
edges in B at most s times, since there are exactly s ob¬ 
servable variables which are children of H in B. It is clear 
that s < N, so each edge emanated from a sum node in S 
will be counted at most N times in B. Edges from prod¬ 
uct nodes will not occur in the graph of B, instead, they 
have been counted in the ADD representations of the local 
CPDs in B. So again, the size of the graph B is bounded by 
Y,j^scope{H) X deg{H) < Y,jjNdeg{H) < 27V|5|. 

There are N observable variables in B. So the total size 
of B, including the size of the graph and the size of all the 
ADDs,isboundedby7V|5|-f|5|-f2iV|5| = 0(A^|5|). □ 

We give the time complexity of Alg. 3, 4 and 5. 

Theorem 14. Eor any normal SPN S over Alg. 3, 4 
and 5 construct an equivalent BN in time 0(iV|5|). 


Proof. Eirst consider Alg. 3. Alg. 3 recursively visits each 
node and its children in S if they have not been visited 
(Lines 6-10). Eor each node v in S, Lines 7-9 cost at 
most 2 • out-degree(z;). If u is a sum node, then Lines 11- 
17 create a hidden variable and then connect the hidden 
variable to all observable variables that appear in the sub- 
SPN rooted at v, which is clearly bounded by the number 
of all observable variables, N. So the total cost of Alg. 3 
is bounded by 2 • out-degree(u) -f Y.v is a sum node ^ < 

293(5) -f 2€(5) -f A9J(5) < 2|5| -f N\S\ = 0(iV|5|). 
Note that we assume that inserting an element into a set can 
be done in 0(1) by using hashing. 

The analysis for Alg. 4 and 5 follows from the same anal¬ 
ysis as in the proof for Thm. 13. The time complexity 
for Alg. 4 and Alg. 5 is then bounded by iV|5| -f |5| = 
0(7V|5|). □ 

Proof of Thm. 1. The combination of Thm. 12, 13 and 14 
proves Thm. 1. □ 

Proof of Corollary. 2. Given a complete and consistent 

SPN S, we can first transform it into a normal SPN S' with 
|5'| = 0(|5p) by Thm. 4 if it is not normal. After this the 
analysis follows from Thm. 1. □ 

4.3. BN to SPN 

It is known that a BN with CPDs represented by tables can 
be converted into an SPN by hrst converting the BN into a 
junction tree and then translating the junction tree into an 
SPN. The size of the generated SPN, however, will be ex¬ 
ponential in the tree-width of the original BN since the tab¬ 
ular representation of CPDs is ignorant of CSI. As a result, 
the generated SPN loses its power to compactly represent 
some BNs with high tree-width, yet, with CSI in its local 
CPDs. 

Alternatively, one can also compile a BN with ADDs into 
an AC (Chavira & Darwiche, 2007) and then convert an 
AC into an SPN (Rooshenas & Lowd, 2014). However, in 
Chavira & Darwiche (2007)’s compilation approach, the 
variables appearing along a path from the root to a leaf 
in each ADD must be consistent with a pre-defined global 
variable ordering. The global variable ordering, may, to 
some extent restrict the compactness of ADDs as the most 
compact representation for different ADDs normally have 
different topological orderings. Interested readers are re¬ 
ferred to (Chavira & Darwiche, 2007) for more details on 
this topic. 

In this section, we focus on BNs with ADDs that are con¬ 
structed using Alg. 4 and 5 from normal SPNs. We show 
that when applying VE to those BNs with ADDs we can 
recover the original normal SPNs. The key insight is that 
the structure of the original normal SPN naturally defines a 
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global variable ordering that is consistent with the topolog¬ 
ical ordering of every ADD constructed. More specifically, 
since all the ADDs constructed using Alg. 4 are induced 
sub-SPNs with contraction of product nodes from the orig¬ 
inal SPN S, the topological ordering of all the nodes in S 
can be used as the pre-defined variable ordering for all the 
ADDS. 


Algorithm 6 Multiplication of two symbolic ADDs, 


Input: Symbolic ADD Axi, Ax 2 
Output: Symbolic ADD Axi 


> -4x0 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 


Ri ^ root of Axi, R 2 ^ root of Ax 2 
if i?i and i ?2 are both variable nodes then 
if Ri — i ?2 then 

Create a node R = Ri into Axi,X 2 
for each r G dom{R) do 
A^x, ^ Ch{Ri)\r 
Ax2 C^(^2)|r 
“4xi ,X2 ^ -4xi ® ■^X2 
Link Ax^ X 2 child of R in -4xi,X2 

end for 
else 

-4xi X 2 ^ create a symbolic node 0 
Link Axi and Ax 2 as two children of 0 

end if 

else if i?i is a variable node and i ?2 is 0 then 
if Ri appears as a child of R 2 then 
•4-Xi,X 2 ^ "4x2 

”4xi,X2 ^ -4^1 

else 

Link ,Axi as a new child of i ?2 
"4xi,X 2 ^ "4x2 

end if 

else if i?i is 0 and R 2 is a variable node then 
if i ?2 appears as a child of Ri then 
"4xi,X 2 ^ -4xi 

-4xi ,X2 ^ -4^2 

else 

Link ^X 2 as a new child of Ri 
"4xi,X 2 ^ "4xi 

end if 
else 

"4xi ,X 2 ^ create a symbolic node 0 
Link ,Axi and ,4x2 as two children of 0 

end if 

Merge connected product nodes in ,4xi ,X 2 






In order to apply VE to a BN with ADDs, we need to 
show how to apply two common operations used in VE, 
i.e., multiplication of two factors and summing-out a hid¬ 
den variable, on ADDs. Eor our purpose, we use a symbolic 
ADD as an intermediate representation during the inference 
process of VE by allowing symbolic operations, such as 


Algorithm 7 Summing-out a hidden variable H from A 
using Ah, © 

Input: Symbolic ADDs A and Ah 
Output: Symbolic ADD with H summed out 
1: M H appears in A then 

2: Label each edge emanating from H with weights ob¬ 

tained from Ah 

3: Replace iJ by a symbolic 0 node 

4: end if 


0,—, X,/ to appear as internal nodes in ADDs. In this 
sense, an ADD can be viewed as a special type of symbolic 
ADD where all the internal nodes are variables. The same 
trick was applied by (Chavira & Darwiche, 2007) in their 
compilation approach. Eor example, given symbolic ADDs 
,4xi over Xi and ,4x2 ^ 2 , Alg. 6 returns a symbolic 

ADD ,4xi,X 2 over Xi,X 2 such that ,4xi.X2 2 ^ 2 ) = 
(-4xi 0-4x2) (xi,X 2 ) = "4xi{a:i) x Ax 2 (x 2 )- To sim¬ 
plify the presentation, we choose the inverse topological 
ordering of the hidden variables in the original SPN S as 
the elimination order used in VE. This helps to avoid the 
situations where a multiplication is applied to a sum node 
in symbolic ADDs. Other elimination orders could be used, 
but a more detailed discussion of sum nodes is needed. 

Given two symbolic ADDs Axi and Ax 2 , Alg. 6 recur¬ 
sively visits nodes in Axi and Ax 2 simultaneously. In 
general, there are 3 cases: 1) the roots of Axi and Ax 2 are 
both variable nodes (Lines 2-14); 2) one of the two roots is 
a variable node and the other is a product node (Lines 15- 
30); 3) both roots are product nodes or at least one of them 
is a sum node (Lines 31-34). We discuss these 3 cases. 

If both roots of ,4xi and ,4x2 are variable nodes, there are 
two subcases to be considered. Eirst, if they are nodes la¬ 
beled with the same variable (Lines 3-10), then the compu¬ 
tation related to the common variable is shared and the mul¬ 
tiplication is recursively applied to all the children, other¬ 
wise we simply create a symbolic product node 0 and link 
,4xi and ,4x2 as its two children (Lines 11-14). Once we 
find i?i S Axi and R 2 G Ax 2 such that i?i f R 2 , there 
will be no common node that is shared by the sub-ADDs 
rooted at Ri and i? 2 . To see this, note that Alg. 6 recur¬ 
sively calls itself as long as the roots of Axi and Ax 2 are 
labeled with the same variable. Let R be the last variable 
shared by the roots of ,4xi and ,4x2 Alg. 6. Then Ri and 
i ?2 must be the children of R in the original SPN S. Since 
i?i does not appear in Ax 2 , then X 2 ^ scope(iii), other¬ 
wise i?i will occur in ,4x2 and i?i will be a new shared 
variable below R, which is a contradiction to the fact that 
R is the last shared variable. Since i?i is the root of the 
sub-ADD of Axi rooted at R, hence no variable whose 
scope contains X 2 will occur as a descendant of Ri, other¬ 
wise the scope of i?i will also contain X 2 , which is again 
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a contradiction. On the other hand, each node appearing in 
Ax 2 corresponds to a variable whose scope intersects with 
{X2} in the original SPN, hence no node in Ax2 will ap¬ 
pear in Axi ■ The same analysis also applies to i? 2 . Hence 
no node will be shared between Axi and Ax^ ■ 

If one of the two roots, say, i?i, is a variable node and the 
other root, say, is a product node, then we consider two 
subcases. If i?i appears as a child of i ?2 then we recursively 
multiply i?i with the child of i ?2 that is labeled with the 
same variable as R\ (Lines 16-18). If R\ does not appear 
as a child of R^, then we link the ADD rooted at i?i to be a 
new child of the product node R^ (Lines 19-22). Again, let 
R be the last shared node between Axx during the 

multiplication process. Then both i?i and i ?2 are children 
of i?, which corresponds to a sum node in the original SPN 
S. Furthermore, both Ri and R2 lie in the same branch 
of R in S. In this case, since scope(i?i) C scope(i?), 
scope (i?i) must be a strict subset of scope (i?) otherwise 
we would have scope(i?i) = scope(i?) and Ri will also 
appear in Ax^^ which contradicts the fact that R is the last 
shared node between Axi and Ax^ ■ Hence here we only 
need to discuss the two cases where either their scope dis¬ 
joint (Line 16-18) or the scope of one root is a strict subset 
of another (Line 19-22). 

If the two roots are both product nodes or at least one of 
them is a sum node, then we simply create a new product 
node and link Axi and Axs to be children of the product 
node. The above analysis also applies here since sum nodes 
in symbolic ADD are created by summing out processed 
variable nodes and we eliminate all the hidden variables 
using the inverse topological ordering. 

The last step in Alg. 6 (Line 35) simplifies the symbolic 
ADD by merging all the connected product nodes with¬ 
out changing the function it encodes. This can be done in 
the following way: suppose 01 and 02 are two connected 
product nodes in symbolic ADD A where 01 is the par¬ 
ent of 02 ^ then we can remove the link between 0i and 
02 and connect 0i to every child of 02. It is easy to ver¬ 
ify that such an operation will remove links between con¬ 
nected product nodes while keeping the encoded function 
unchanged. 

To sum-out one hidden variable H, Alg. 7 simply replaces 
77 in ,4 by a symbolic sum node © and labels each edge of 
© with weights obtained from Ah- 

We now present the Variable Elimination (VE) algorithm 
in Alg. 8 used to recover the original SPN S, taking Alg. 6 
and Alg. 7 as two operations 0 and © respectively. 

In each iteration of Alg. 8, we select one hidden variable 
77 in ordering tt, multiply all the ADDs Ax in which 77 
appears using Alg. 6 and then sum-out 77 using Alg. 7. The 
algorithm keeps going until all the hidden variables have 


Algorithm 8 Variable Elimination for BN with ADDs 
Input: BN B with ADDs for all observable variables and 
hidden variables 
Output: Original SPN S 

1: TT •(— the inverse topological ordering of all the hidden 
variables present in the ADDs 
2 : ^ X- {Ax I A is an observable variable} 

3: for each hidden variable 77 in tt do 
4: P X- {Ax I H appears in Ax} 

5: <!>•<— $\7’ U {©rr 0ytef -T} 

6: end for 
7: return $ 


been summed out and there is only one symbolic ADD left 
in $. The final symbolic ADD gives us the SPN S which 
can be used to build BN B. Note that the SPN returned 
by Alg. 8 may not be literally equal to the original SPN 
since during the multiplication of two symbolic ADDs we 
effectively remove redundant nodes by merging connected 
product nodes. Hence, the SPN returned by Alg. 8 could 
have a smaller size while representing the same probability 
distribution. An example is given in Pig. 6 to illustrate the 
recovery process. The BN in Pig. 6 is the one constructed 
in Pig. 5. 

Note that Alg. 6 and 7 apply only to ADDs constructed 
from normal SPNs by Alg. 4 and 5 because such ADDs nat¬ 
urally inherit the topological ordering of sum nodes (hidden 
variables) in the original SPN S. Otherwise we need to pre¬ 
define a global variable ordering of all the sum nodes and 
then arrange each ADD such that its topological ordering 
is consistent with the pre-defined ordering. Note also that 
Alg. 6 and 7 should be implemented with caching of re¬ 
peated operations in order to ensure that directed acyclic 
graphs are preserved. Alg. 8 suggests that an SPN can also 
be viewed as a history record or caching of the sums and 
products computed during inference when applied to the 
resulting BN with ADDs. 

We now bound the run time of Alg. 8. 

Theorem 15. Alg. 8 builds SPN S from BN B with ADDs 

mO{N\S\). 

Proof. Pirst, it is easy to verify that Alg. 6 takes at most 
\Axi I + \Ax2 I operations to compute the multiplication of 
Axi and Ax 2 - More importantly, the size of the gener¬ 
ated Axi,X 2 is also bounded by |5|. This is because all 
the common nodes and edges in Axi and Ax 2 are shared 
(not duplicated) in Axi,X 2 - Also, all the other nodes and 
edges which are not shared between Axi and Ax 2 will be 
in two branches of a product node in S, otherwise they will 
be shared by Ax^ and .4x2 they have the same scope 
which contain both X\ and A 2 . This means that Axx,X 2 
can be viewed as a sub-SPN of S induced by the node set 
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Figure 6. Multiply Axi and Ax 2 that contain H using Alg. 6 and then sum out H by applying Alg. 7. The final SPN is isomorphic with 
the SPN in Fig. 5. 


{Xi,X 2 } with some product nodes contracted out. So we 
have \Axi,X 2 \ < l^j. 

Now consider the for loop (Lines 3-6) in Alg. 8. The loop 
ends once we’ve summed out all the hidden variables and 
there is only one ADD left. Note that there may be only one 
ADD in $ during some intermediate steps, in which case 
we do not have to do any multiplication. In such steps, we 
only need to perform the sum out procedure without mul¬ 
tiplying ADDs. Since there are N ADDs at the beginning 
of the loop and after the loop we only have one ADD, then 
there is exactly — 1 multiplications during the for loop, 
which costs at most {N — 1)|5| operations. Furthermore, 
in each iteration there is exactly one hidden variable be¬ 
ing summed out. So the total cost for summing out all the 
hidden variables in Lines 3-6 is bounded by |5|. 

Overall, the operations in Alg. 8 are bounded by {N — 

1)|5|-F |5| = 0(iV|5|). □ 

Proof of Thm. 3. Thm. 15 and the analysis above prove 
Thm. 3. □ 

5. Discussion 

Thm. 1 together with Thm. 3 establish a relationship be¬ 
tween BNs and SPNs: SPNs are no more powerful than 
BNs with ADD representation. Informally, a model is con¬ 
sidered to be more powerful than another if there exists a 
distribution that can be encoded in polynomial size in some 
input parameter N, while the other model requires expo¬ 
nential size in N to represent the same distribution. The 
key is to recognize that the CSI encoded by the structure 
of an SPN as stated in Proposition. 21 can also be encoded 
explicitly with ADDs in a BN. We can also view an SPN 
as an inference machine that efficiently records the history 
of the inference process when applied to a BN. Based on 
this perspective, an SPN is actually storing the calculations 
to be performed (sums and products), which allows online 
inference queries to be answered quickly. The same idea 
also exists in other fields, including propositional logic (d- 
DNNF) and knowledge compilation (AC). 

The constructed BN has a simple bipartite structure, no 
matter how deep the original SPN is. However, we can 
relate the depth of an SPN to a lower bound on the tree- 


width of the corresponding BN obtained by our algorithm. 
Without loss of generality, let’s assume that product layers 
alternate with sum layers in the SPN we are considering. 
Let the height of the SPN, i.e., the longest path from the 
root to a terminal node, be K. By our assumption, there 
will be at least \_K/2\ sum nodes in the longest path. Ac¬ 
cordingly, in the BN constructed by Alg. 3, the observable 
variable corresponding to the terminal node in the longest 
path will have in-degree at least [Ar/2J. Hence, after mor¬ 
alizing the BN into an undirected graph, the clique-size of 
the moral graph is bounded below by [iT/2j + 1. Note 
that for any undirected graph the clique-size minus 1 is al¬ 
ways a lower bound of the tree-width. We then reach the 
conclusion that the tree-width of the constructed BN has 
a lower bound of [Ar/2J. In other words, the deeper the 
SPN, the larger the tree-width of the BN constructed by our 
algorithm and the more complex are the probability distri¬ 
butions that can be encoded. This observation is consistent 
with the conclusion drawn in (Delalleau & Bengio, 2011) 
where the authors prove that there exist families of distri¬ 
butions that can be represented much more efficiently with 
a deep SPN than with a shallow one, i.e. with substantially 
fewer hidden internal sum nodes. Note that we only give a 
proof that there exists an algorithm that can convert an SPN 
into a BN without any exponential blow-up. There may ex¬ 
ist other techniques to convert an SPN into a BN with a 
more compact representation and also a smaller tree-width. 

High tree-width is usually used to indicate a high inference 
complexity, but this is not always true as there may exist 
lots of CSI between variables, which can reduce inference 
complexity. CSI is precisely what enables SPNs and BNs 
with ADDs to compactly represent and tractably perform 
inference in distributions with high tree-width. In con¬ 
trast, in a Restricted Boltzmann Machine, which is an undi¬ 
rected bipartite Markov network, CSI may not be present 
or not exploited, which is why practitioners have to re¬ 
sort to approximate algorithms, such as contrastive diver¬ 
gence (Carreira-Perpinan & Hinton, 2005). Similarly, ap¬ 
proximate inference is required in bipartite diagnostic BNs 
such as the Quick Medical Reference network (Shwe et al., 
1991) since causal independence is insufficient to reduce 
the complexity, while CSI is not present or not exploited. 
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6. Conclusion 

In this paper, we establish a precise connection between 
BNs and SPNs by providing a constructive algorithm to 
transform between these two models. To simplify the 
proof, we introduce the notion of normal SPN and describe 
the relationship between consistency and decomposability 
in SPNs. We analyze the impact of the depth of SPNs onto 
the tree-width of the corresponding BNs. Our work also 
provides a new direction for future research about SPNs 
and BNs. Structure and parameter learning algorithms for 
SPNs can now be used to indirectly learn BNs with ADDs. 
In the resulting BNs, correlations are not expressed by links 
directly between observed variables, but rather through hid¬ 
den variables that are ancestors of correlated observed vari¬ 
ables. The structure of the resulting BNs can be used to 
study probabilistic dependencies and causal relationships 
between the variables of the original SPNs. It would also be 
interesting to explore the opposite direction since there is 
already a large literature on parameter and structure learn¬ 
ing for BNs. One could learn a BN from data and then 
exploit CSI to convert it into an SPN. 
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